\\epsilon)$ for each $n$ for some $\\epsilon\\in (0,1)$. For $X_1$, we have $P(X_1>\\epsilon)=1$ because we already chose $\\epsilon$ in the interval covered by $X_1$. For $X_2$, we have $P(X_2>\\epsilon)=1/2$, for $X_3$, we have $P(X_3>\\epsilon)=1/2$, and so on. This produces the following sequence:\n",
"\n",
"$$\n",
"(1,\\frac{1}{2},\\frac{1}{2},\\frac{1}{3},\\frac{1}{3},\\ldots).\n",
"$$\n",
"\n",
"The limit of the sequence is zero, so $X_n \\overset{P}{\\to} 0$. However, for every $x\\in (0,1)$, the sequence of function values of $X_n(x)$ consists of infinitely many zeros and ones (remember that indicator functions can evaluate to either zero or one). Thus, the set of $x$ for which the sequence $X_n(x)$ converges is empty because the sequence bounces between zero and one. This means that almost sure convergence fails even though we have convergence in probability. The key distinction is that convergence in probability considers the convergence of a sequence of probabilities whereas almost sure convergence is concerned about the sequence of values of the random variables over sets of events that *fill out* the underlying probability space entirely (i.e., with probability one).\n",
"\n",
"This is a good example so let's see if we can make it concrete with some\n",
"Python. The following is a function to compute the different subintervals,"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"make_interval= lambda n: np.array([(i/n,j/n) for i,j in zip(range(n+1),range(1,n+1))])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Now, we can use this function to create a Numpy\n",
"array of intervals, as in the example,"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 0. 1. ]\n",
" [ 0. 0.5 ]\n",
" [ 0.5 1. ]\n",
" [ 0. 0.33333333]\n",
" [ 0.33333333 0.66666667]\n",
" [ 0.66666667 1. ]\n",
" [ 0. 0.25 ]\n",
" [ 0.25 0.5 ]\n",
" [ 0.5 0.75 ]\n",
" [ 0.75 1. ]]\n"
]
}
],
"source": [
"intervals= np.vstack([make_interval(i) for i in range(1,5)])\n",
"print(intervals)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" The following function computes the bit string in our example,\n",
"$\\lbrace X_1,X_2,\\ldots,X_n \\rbrace$,"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 0, 1, 0, 1, 0, 0, 0, 1, 0])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bits= lambda u:((intervals[:,0] < u) & (u<=intervals[:,1])).astype(int)\n",
"bits(u.rvs())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Now that we have the individual bit strings, to show convergence we\n",
"want to show that the probability of each entry goes to a limit. For example,\n",
"using ten realizations,"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[1 1 0 1 0 0 1 0 0 0]\n",
" [1 0 1 0 0 1 0 0 0 1]\n",
" [1 0 1 0 1 0 0 0 1 0]\n",
" [1 1 0 1 0 0 1 0 0 0]\n",
" [1 0 1 0 1 0 0 0 1 0]\n",
" [1 0 1 0 1 0 0 0 1 0]\n",
" [1 1 0 0 1 0 0 1 0 0]\n",
" [1 1 0 1 0 0 1 0 0 0]\n",
" [1 0 1 0 1 0 0 0 1 0]\n",
" [1 1 0 1 0 0 1 0 0 0]]\n"
]
}
],
"source": [
"print (np.vstack([bits(u.rvs()) for i in range(10)]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We want the limiting probability of a one in each row to convert to a limit. We can estimate this over 1000 realizations using the following code,"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 1. , 0.494, 0.506, 0.311, 0.346, 0.343, 0.225, 0.269,\n",
" 0.247, 0.259])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.vstack([bits(u.rvs()) for i in range(1000)]).mean(axis=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that these entries should approach the $(1,\\frac{1}{2},\\frac{1}{2},\\frac{1}{3},\\frac{1}{3},\\ldots)$ sequence we found earlier. The following figure shows the convergence of these probabilities for a large number of intervals.\n",
"\n",
"
\n",
"\n",
"Eventually, the probability shown on this graph will decrease to zero with large enough $n$. Again, note that the individual sequences of zeros and ones do not converge, but the probabilities of these sequences converge. This is the key difference between almost sure convergence and convergence in probability. Thus, convergence in probability does *not* imply almost sure convergence. Conversely, almost sure convergence *does* imply convergence in probability.\n",
"\n",
"The following notation should help emphasize the difference between almost sure convergence and convergence in probability, respectively,\n",
"\n",
"$$\n",
"\\begin{align*}\n",
"P\\left(\\lim_{n\\rightarrow \\infty} \\vert X_n-X\\vert < \\epsilon\\right)&=1 \\texttt{(almost sure convergence)} \\\\\\\n",
"\\lim_{n\\rightarrow \\infty} P(\\vert X_n-X\\vert < \\epsilon)&=1 \\texttt{(convergence in probability)}\n",
"\\end{align*}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Convergence in Distribution\n",
"\n",
"So far, we have been discussing convergence in terms of sequences of probabilities or sequences of values taken by the random variable. By contrast, the next major kind of convergence is *convergence in distribution*, where,\n",
"\n",
"$$\n",
"\\lim_{n \\to \\infty} F_n(t) = F(t)\n",
"$$\n",
"\n",
" for all $t$ for which $F$ is continuous and $F$ is the\n",
"cumulative density function. For this case, convergence is only\n",
"concerned with the cumulative density function, written as $X_n\n",
"\\overset{d}{\\to} X$. \n",
"\n",
"**Example.** To develop some intuition about this kind of convergence,\n",
"consider a sequence of $X_n$ Bernoulli random variables. Furthermore,\n",
"suppose these are all really just the same random variable $X$.\n",
"Trivially, $X_n \\overset{d}{\\to} X$. Now, suppose we define $Y=1-X$,\n",
"which means that $Y$ has the same distribution as $X$. Thus, $X_n\n",
"\\overset{d}{\\to} Y$. By contrast, because $\\vert X_n - Y\\vert=1$ for all\n",
"$n$, we can never have almost sure convergence or convergence in\n",
"probability. Thus, convergence in distribution is the weakest\n",
"of the three forms of convergence in the sense that it is implied by\n",
"the other two, but implies neither of the two.\n",
"\n",
"As another striking example, we could have $Y_n \\overset{d}{\\to} Z$ where $Z\n",
"\\sim \\mathcal{N}(0,1)$, but we could also have $Y_n \\overset{d}{\\to} -Z$.\n",
"That is, $Y_n$ could converge in distribution to either $Z$ or $-Z$. This\n",
"may seem ambiguous, but this kind of convergence is practically very useful\n",
"because it allows for complicated distributions to be approximated by\n",
"simpler distributions. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Limit Theorems\n",
"\n",
"\n",
"Now that we have all of these notions of convergence, we can apply them to\n",
"different situations and see what kinds of claims we can construct from them.\n",
"\n",
"**Weak Law of Large Numbers.** Let $\\lbrace X_1,X_2,\\ldots,X_n \\rbrace$ be an\n",
"iid set of random variables with finite mean $\\mathbb{E}(X_k)=\\mu$ and finite\n",
"variance. Let $\\overline{X}_n = \\frac{1}{n}\\sum_k X_k$. Then, we have\n",
"$\\overline{X}_n \\overset{P}{\\to} \\mu$. This result is important because we\n",
"frequently estimate parameters using an averaging process of some kind. This\n",
"basically justifies this in terms of convergence in probability. Informally,\n",
"this means that the distribution of $\\overline{X}_n$ becomes\n",
"concentrated around $\\mu$ as $n\\rightarrow\\infty$.\n",
"\n",
"**Strong Law of Large Numbers.** Let $\\lbrace X_1,X_2,\\ldots,\\rbrace$ be an\n",
"iid set of random variables. Suppose that $\\mu=\\mathbb{E}\\vert\n",
"X_i\\vert<\\infty$, then $\\overline{X}_n \\overset{as}{\\to} \\mu$. The reason this\n",
"is called the strong law is that it implies the weak law because almost sure\n",
"convergence implies convergence in probability. The so-called Komogorov\n",
"criterion gives the convergence of the following,\n",
"\n",
"$$\n",
"\\sum_k \\frac{\\sigma_k^2}{k^2}\n",
"$$\n",
"\n",
" as a sufficient condition for concluding that the Strong Law applies\n",
"to the sequence $ \\lbrace X_k \\rbrace$ with corresponding $\\lbrace \\sigma_k^2\n",
"\\rbrace$.\n",
"\n",
"As an example, consider an infinite sequence of Bernoulli trials with $X_i=1$\n",
"if the $i^{th}$ trial is successful. Then $\\overline{X}_n$ is the relative\n",
"frequency of successes in $n$ trials and $\\mathbb{E}(X_i)$ is the\n",
"probability $p$ of success on the $i^{th}$ trial. With all that established,\n",
"the Weak Law says only that if we consider a sufficiently large and fixed\n",
"$n$, the probability that the relative frequency will converge to $p$ is\n",
"guaranteed. The Strong Law states that if we regard the observation of all\n",
"the infinite $\\lbrace X_i \\rbrace$ as one performance of the experiment, the\n",
"relative frequency of successes will almost surely converge to $p$. The\n",
"difference between the Strong Law and the Weak Law of large numbers is\n",
"subtle and rarely arises in practical applications of probability theory.\n",
"\n",
"**Central Limit Theorem.** Although the Weak Law of Large Numbers tells us\n",
"that the distribution of $\\overline{X}_n$ becomes concentrated around $\\mu$, it\n",
"does not tell us what that distribution is. The Central Limit Theorem (CLT)\n",
"says that $\\overline{X}_n$ has a distribution that is approximately Normal\n",
"with mean $\\mu$ and variance $\\sigma^2/n$. Amazingly, nothing is assumed\n",
"about the distribution of $X_i$, except the existence\n",
"of the mean and variance. The following is the Central Limit Theorem:\n",
"Let $\\lbrace X_1,X_2,\\ldots,X_n \\rbrace$ be iid with mean $\\mu$ and\n",
"variance $\\sigma^2$. Then,\n",
"\n",
"$$\n",
"Z_n = \\frac{\\sqrt{n}(\\overline{X}_n-\\mu)}{\\sigma} \\overset{P}{\\longrightarrow} Z\\sim\\mathcal{N}(0,1)\n",
"$$\n",
"\n",
" The loose interpretation of the Central Limit Theorem is that\n",
"$\\overline{X}_n$ can be legitimately approximated by a Normal distribution.\n",
"Because we are talking about convergence in probability here, claims\n",
"about probability are legitimized, not claims about the random variable\n",
"itself. Intuitively, this shows that normality arises from sums of small,\n",
"independent disturbances of finite variance. Technically, the finite\n",
"variance assumption is essential for normality. Although the Central Limit\n",
"Theorem provides a powerful, general approximation, the quality of the\n",
"approximation for a particular situation still depends on the original\n",
"(usually unknown) distribution."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Useful Inequalities\n",
"\n",
"In practice, few quantities can be analytically calculated. Some knowledge\n",
"of bounding inequalities helps find the ballpark for potential solutions. This\n",
"sections discusses three key inequalities that are important for \n",
"probability, statistics, and machine learning.\n",
"\n",
"## Markov's Inequality\n",
"\n",
"Let $X$ be a non-negative random variable\n",
"and suppose that $\\mathbb{E}(X) < \\infty$. Then,\n",
"for any $t>0$,"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from pprint import pprint\n",
"import textwrap\n",
"import sys, re"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$\n",
"\\mathbb{P}(X>t)\\leq \\frac{\\mathbb{E}(X)}{t}\n",
"$$\n",
"\n",
" This is a foundational inequality that is\n",
"used as a stepping stone to other inequalities. It is easy\n",
"to prove. Because $X>0$, we have the following,\n",
"\n",
"$$\n",
"\\begin{align*}\n",
"\\mathbb{E}(X)&=\\int_0^\\infty x f_x(x)dx =\\underbrace{\\int_0^t x f_x(x)dx}_{\\text{omit this}}+\\int_t^\\infty x f_x(x)dx \\\\\\ \n",
" &\\ge\\int_t^\\infty x f_x(x)dx \\ge t\\int_t^\\infty x f_x(x)dx = t \\mathbb{P}(X>t)\n",
"\\end{align*}\n",
"$$\n",
"\n",
" The step that establishes the inequality is the part where the\n",
"$\\int_0^t x f_x(x)dx$ is omitted. For a particular $f_x(x)$ that my be\n",
"concentrated around the $[0,t]$ interval, this could be a lot to throw out.\n",
"For that reason, the Markov Inequality is considered a *loose* inequality,\n",
"meaning that there is a substantial gap between both sides of the inequality.\n",
"For example, as shown in [Figure](#fig:ProbabilityInequalities_001), the\n",
"$\\chi^2$ distribution has a lot of its mass on the left, which would be omitted\n",
"in the Markov Inequality. [Figure](#fig:ProbabilityInequalities_002) shows\n",
"the two curves established by the Markov Inequality. The gray shaded region is\n",
"the gap between the two terms and indicates that looseness of the bound\n",
"(fatter shaded region) for this case.\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"The $\\chi_1^2$ density has much of its weight on the left, which is excluded in the establishment of the Markov Inequality.
\n",
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"The shaded area shows the region between the curves on either side of the Markov Inequality.
\n",
"
\n",
"\n",
"\n",
"\n",
"\n",
"## Chebyshev's Inequality\n",
"\n",
"Chebyshev's Inequality drops out directly from the Markov Inequality. Let\n",
"$\\mu=\\mathbb{E}(X)$ and $\\sigma^2=\\mathbb{V}(X)$. Then, we have\n",
"\n",
"$$\n",
"\\mathbb{P}(\\vert X-\\mu\\vert \\ge t) \\le \\frac{\\sigma^2}{t^2}\n",
"$$\n",
"\n",
" Note that if we normalize so that $Z=(X-\\mu)/\\sigma$, we\n",
"have $\\mathbb{P}(\\vert Z\\vert \\ge k) \\le 1/k^2$. In particular,\n",
"$\\mathbb{P}(\\vert Z\\vert \\ge 2) \\le 1/4$. We can illustrate this\n",
"inequality using Sympy statistics module,\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import sympy\n",
"import sympy.stats as ss\n",
"t=sympy.symbols('t',real=True)\n",
"x=ss.ChiSquared('x',1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" To get the left side of the Chebyshev inequality, we\n",
"have to write this out as the following conditional probability,"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"r = ss.P((x-1) > t,x>1)+ss.P(-(x-1) > t,x<1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" This is because of certain limitations in the statistics module at\n",
"this point in its development regarding the absolute value function. We could\n",
"take the above expression, which is a function of $t$ and attempt to compute\n",
"the integral, but that would take a very long time (the expression is very long\n",
"and complicated, which is why we did not print it out above). This is because\n",
"Sympy is a pure-python module that does not utilize any C-level optimizations\n",
"under the hood. In this situation, it's better to use the built-in cumulative\n",
"density function as in the following (after some rearrangement of the terms),"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"w=(1-ss.cdf(x)(t+1))+ss.cdf(x)(1-t)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" To plot this, we can evaluated at a variety of `t` values by using\n",
"the `.subs` substitution method, but it is more convenient to use the\n",
"`lambdify` method to convert the expression to a function."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"fw=sympy.lambdify(t,w)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Then, we can evaluate this function using something like"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"